Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support determining mode based on shebang interpreter directive #47

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

iainh
Copy link
Contributor

@iainh iainh commented Jul 18, 2022

Attempt to determine the appropriate mode based on the presence of an interpreter directive on the first line of the file. Interpreter directives are examined first, followed by the filename when determining the mode mirroring what Emacs does by default.

Reopening now that highlight-with-queries has been merged. Feel free to close if there is no interest in supporting this feature.

Attempt to determine the appropriate mode based on the presence of
an interpreter directive on the first line of the file. Interpreter
directives are examined first, followed by the filename when
determining the mode mirroring what Emacs does by default.
Copy link
Collaborator

@mcobzarenco mcobzarenco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome! Thanks so much, it's a very useful addition I wished I had before.

Sorry for the multiple comments and being pedantic about not loading the whole file, happy to help with any of them

@@ -17,6 +17,8 @@ pub struct ModeConfig {
pub comment: Option<CommentConfig>,
pub indentation: IndentationConfig,
pub grammar: Option<GrammarConfig>,
#[serde(default)]
pub shebangs: Vec<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking whether this should be a new field, rather than a new variant of FilenamePattern -- granted, we should probably rename it to FilePattern as it will look inside the file too to determine if the mode applies. I.e. something like FilePattern::Shebang.

The structure I suggest may make it harder to avoid reading the file when the filename would suffice.

Comment on lines 176 to 184
let mode = text
.line(0)
.as_str()
.and_then(|shebang| context.0.mode_by_shebang(shebang))
.or_else(|| {
file_path
.as_ref()
.and_then(|path| context.0.mode_by_filename(path))
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may potentially read the whole file in pathological cases, e.g. a minified file that has no new lines. One goal of zee is to keep being fast for any kind of pathological file you can think of and do anything that is potentially blocking in the background (e.g. parsing syntax or writing the file to disk). I've also been trying to avoid doing anything linear in the length of a line in the UI thread.

I think the right solution here long term is to build buffers, i.e. call Buffer::new() in a background thread, rather than in the main, UI thread.

For this PR though, I'd be happy if instead you just bound how much of the file you read, say 256 bytes at most and test the regex for that. You'll have to deal with potentially truncated utf-8...

Maybe a better solution is to read characters until you encounter either 1. a new line or 2. if you don't after X characters, you give you and don't check the shebang -- essentially we only test if shebangs apply up to a certain line length.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 2nd comment is that I think we should test if "mode_by_filename" applies and only then check the shebangs to avoid having to read the file if the name already matches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions. I'm looking to see if there is any standards around whether white space is allowed before the #! and what the maximum length is on most platforms. I have some outdated information that 127-512 bytes is the maximum but with a lack of a standard to point at, I think an overly large maximum might be best. FreeBSD for example historically supported 4096. If white space before the shebang is not allowed then a two pass strategy where only the first two characters are examined followed by the remainder of the line up to the maximum discussed might be the most efficient.

Comment on lines 97 to 106
pub fn mode_by_filename(&self, filename: impl AsRef<Path>) -> &Mode {
pub fn mode_by_filename(&self, filename: impl AsRef<Path>) -> Option<&Mode> {
self.modes
.iter()
.find(|&mode| mode.matches_by_filename(filename.as_ref()))
.unwrap_or(&PLAIN_TEXT_MODE)
}

pub fn mode_by_shebang(&self, shebang: &str) -> Option<&Mode> {
self.modes
.iter()
.find(|&mode| mode.matches_by_shebang(shebang))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If shebang is a variant of FilenamePattern, there would be one function and we could continue to return &Mode rather than Option<&Mode> -- I would like to continue have Context own coming up with a Mode for any possible file whatsoever and not have a default PLAIN_TEXT_MODE potentially duplicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about merging these methods into something like mode_by_file_pattern that would always return a &Mode but I'm stuck on how to handle the difference in parameters. Two of the variants operate on the filename while the other needs a portion of the file content. To have one method we would have to always pass a portion of the file content to the function in addition to the filename which would require reading at least part of the content of every file.

All of that being said, prior to detecting the mode or creating the buffer open_file() is calling Rope::from_reader() which I think might be reading the entire file anyway.

Rope::from_reader(BufReader::new(File::open(&file_path)?))?,
I reading that right? If so we already have the Rope created and available in Buffer::new() so reading a slice should be quite efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an updated version of this code with a limit on the length of text examined for the interpreter directive, an inversion of the order mode checks are performed (filename checks first, then shebang), and a merging of the mode_by_filename() and mode_by_shebang() methods into one, mode_by_file(), which always returns a &Mode.

I ended up settling on 256 characters for the maximum length of the shebang directive, matching what linux has done since 2018 (https://lore.kernel.org/lkml/20181112160956.GA28472@redhat.com/) (https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h)

I haven’t found a satisfactory way of adding a shebang case to FilenamePattern, the biggest stumbling blocks being that the shebang line would need to be passed to the matches() method and the pattern list would need to be sorted/filtered if we wanted to ensure that filename patterns were always examined before shebang directives.

1) Constrain the number of characters examine when determining the
   interpreter directive to 256 to avoid performance penalty of scanning
   long lines unnecessarily.
2) Move to a single mode_by_file() function in Context.
3) Always attempt to match the mode based on the filename first,
   falling back to the shebang line only if no match is found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants